Fix hallucinations during silence#2629
Conversation
When the predicted tokens end with a single timestamp the the entire 30 segment should be considered as done, to avoid hallucinations for the remaining part of segment. This behaviour is on par with openai's whisper. Refer to logic related to `single_timestamp_ending` in https://github.com/openai/whisper/blob/main/whisper/transcribe.py
|
We need this so bad. Hopefully it'll work with the swift package? |
@itsthisjustin Yes, of course. The fix is done in the core |
|
gonna test this..here is 1.7.2 [00:01:07.360 --> 00:01:07.820] Father, for all-in-sacrifice, for all-in-sacrifice, for all-in-sacrifice, for all-in-sacrifice, output_srt: saving output to '0155.srt' now let's see with the patch ...downloaded the new whisper.cpp in src 2000 ( 3m chapters ) = 6,000 minutes or 100 hours Duration of audiobook 294660 seconds Total number of chapters: 187 Average length of chapters 1625 chunks for ~182 secs or 00h:03m:02s splits |
|
@mrfragger And here is the output with this fixed branch. Command line : Please note that the extra hallucinations are removed in this branch. |
|
@mrfragger |
|
It's a really bad audio recording of a conversation...that portion. Anyway yeah I most of the time I will eliminate all silence before compiling the audiobook to transcribe. Also if there are music intros and outros trim those if feasible. I believe your patch is addressing the silence so if that does indeed work for that it would be a huge boon. So far I'm been running your patch for the last 6 or 7 hours and no negative effects or anything unusual. |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
# By Georgi Gerganov (4) and others # Via GitHub * ggerganov/master: stream : improve consistency in README (ggml-org#2642) whisper : support no_speech_thold (ggml-org#2625) whisper : add single-timestamp logic (ggml-org#2629) readme : fix typo (ggml-org#2637) cmake : fix "amd64" processor string (ggml-org#2638) vulkan : fix soft_max.comp division by zero (ggml-org#2633) common : add cstdio header stream : update build instructions android : fix build and ci (ggml-org#2624) models : fix typo in download-ggml-model.sh (ggml-org#2623) ruby : Sync whisper.cpp and model download feature (ggml-org#2617) scripts : update to new build system # Conflicts: # src/whisper.cpp
When a specific language is forced (e.g. -l ru, -l es) and a 30-second decoder window is entirely zero-valued, whisper emits language-specific fallback tokens (bracketed music tags like [Música], fake subtitle-editor credits on -l ru). The auto-detect path handles silent chunks naturally. Add a chunk-level zero-PCM check at the top of the seek loop inside whisper_full_with_state. When the current window is all-zero and the caller forced a language, emit a single [BLANK_AUDIO] segment for that chunk and advance without running the encoder or decoder. Matches the approach endorsed in PR ggml-org#1588 review ("skip entire segments when silence is detected"), using zero-PCM as a stricter and language- independent signal than no_speech_prob. The caller's original language intent is captured before the auto- detect block overwrites params.language, so the guard only fires when the user explicitly requested a specific language; auto-detect paths are unchanged. Fixes ggml-org#1724 (residual hallucination on forced-language silence chunks not addressed by ggml-org#2629)
When the predicted tokens end with a single timestamp the the entire 30 segment should be considered as done, to avoid hallucinations for the remaining part of segment.
This behaviour is on par with openai's whisper. Refer to logic related to
single_timestamp_endingin https://github.com/openai/whisper/blob/main/whisper/transcribe.py